Classical neural networks, including convolutional ones, suffer from two severe limitations:
Recurrent neural networks overcome these limitations by allowing to operate over sequences of vectors (in the input, in the output, or both).
RNN have shown success in:
Training a RNN is similar to training a traditional NN, but some modifications. The main reason is that parameters are shared by all time steps: in order to compute the gradient at $t=4$, we need to propagate 3 steps and sum up the gradients. This is called Backpropagation through time (BPTT).
The inputs of a recurrent network are always vectors, but we can process sequences of symbols/words by representing these symbols by numerical vectors.
Let's suppose we are classifying a series of words: $x_1, ..., x_{t-1}, x_t, x_{t+1}, ... x_{T}$ are the word vectors corresponding to a corpus with T symbols. Then, the relationship to compute the hidden layer output features at each time-step $t$ is $h_t = \sigma(W^{(hh)} h_{t-1} + W^{(hx)} x_{t})$, where:
tanh
).The loss function used in RNNs is often the cross entropy error:
$$ L^{(t)}(W) = - \sum_{j=1}^{|V|} y_{t,j} \times log (\hat{y}_{t,j}) $$The cross entropy error over a corpus of size $T$ is:
$$ L = \dfrac{1}{T} \sum_{t=1}^{T} L^{(t)}(W) = - \dfrac{1}{T} \sum_{t=1}^{T} \sum_{j=1}^{|V|} y_{t,j} \times log (\hat{y}_{t,j}) $$In the case of classifying a series of symbols/words, the perplexity measure can be used to assess the goodness of our model. It is basically 2 to the power of the negative log probability of the cross entropy error function:
$$ Perplexity = 2^{L} $$Perplexity is a measure of confusion where lower values imply more confidence in predicting the next word in the sequence (compared to the ground truth outcome).
Recurrent neural networks propagate weight matrices from one time-step to the next. Recall the goal of a RNN implementation is to enable propagating context information through faraway time-steps. When these propagation results in a long series of matrix multiplications, weights can vanish or explode.
Once the gradient value grows extremely large, it causes an overflow (i.e. NaN
) which is easily detectable at runtime; this issue is called the Gradient Explosion Problem.
When the gradient value goes to zero, however, it can go undetected while drastically reducing the learning quality of the model for far-away words in the corpus; this issue is called the Vanishing Gradient Problem.
To solve the problem of vanishing gradients, instead of initializing $W^{hh}$ randomly, starting off from random orthogocal matrices works better, i.e., a square matrix $W$ for which $W^T W=I$.
There are two properties of orthogonal matrices that are useful for training deep neural networks:
At least at the start of training, the first of these should help to keep the norm of the input constant throughout the network, which can help with the problem of exploding/vanishing gradients.
Similarly, an intuitive understanding of the second is that having orthonormal weight vectors encourages the weights to learn different input features.
You can obtain a random $n \times n$ orthogonal matrix $W$, (uniformly distributed) by performing a QR factorization of an $n \times n$ matrix with elements i.i.d. Gaussian random variables of mean $0$ and variance $1$. Here is an example:
In [ ]:
import numpy as np
from scipy.linalg import qr
n = 3
H = np.random.randn(n, n)
print(H)
print('\n')
Q, R = qr(H)
print (Q.dot(Q.T))
The most important types of gated RNNs are:
The key idea of LSTMs is the cell state $C$, the horizontal line running through the top of the diagram.
The cell state is kind of like a conveyor belt. It runs straight down the entire chain, with only some minor linear interactions. It’s very easy for information to just flow along it unchanged.
LSTM has the ability to remove or add information to the cell state, carefully regulated by structures called gates.
Gates are a way to optionally let information through. They are composed out of a sigmoid neural net layer and a pointwise multiplication operation.
Let us see how a LSTM uses $h_{t-1}, C_{t-1}$ and $x_{t}$ to generate the next hidden states $C_t, h_{t}$:
$$ f_t = \sigma(W_f \cdot [h_{t-1}, x_t]) \mbox{ (Forget gate)} $$$$ i_t = \sigma(W_i \cdot [h_{t-1}, x_t]) \mbox{ (Input gate)} $$$$ \tilde C_t = \operatorname{tanh}(W_C \cdot [h_{t-1}, x_t]) $$$$ C_t = f_t * C_{t-1} + i_t * \tilde C_t \mbox{ (Update gate)} $$$$ o_t = \sigma(W_o \cdot [h_{t-1}, x_t]) $$$$ h_t = o_t * \operatorname{tanh}(C_t) \mbox{ (Output gate)} $$There are other variants of LSTM (f.e. LSTM with peephole connections of Gers & Schmidhuber (2000))
The transition from hidden state $h_{t-1}$ to $h_{t}$ in vanilla RNN is defined by using an affine transformation and a point-wise nonlinearity.
What motivates the use of gated units? Although RNNs can theoretically capture long-term dependencies, they are very hard to actually train to do this. Gated recurrent units are designed in a manner to have more persistent memory thereby making it easier for RNNs to capture long-term dependencies.
Let us see how a GRU uses $h_{t-1}$ and $x_{t}$ to generate the next hidden state $h_{t}$.
$$ z_{t} = \sigma(W_z \cdot [x_{t}, h_{t-1}]) \mbox{ (Update gate)}$$$$ r_{t} = \sigma(W_r \cdot [x_{t}, h_{t-1}]) \mbox{ (Reset gate)}$$$$ \tilde{h}_{t} = \operatorname{tanh}(r_{t} \cdot [x_{t}, r_t \circ h_{t-1}] ) \mbox{ (New memory)}$$$$ h_{t} = (1 - z_{t}) \circ \tilde{h}_{t-1} + z_{t} \circ h_{t} \mbox{ (Hidden state)}$$It combines the forget and input gates into a single “update gate.” It also merges the cell state and hidden state, and makes some other changes. The resulting model is simpler than standard LSTM models.
Whenever you train or test your LSTM/GRU, you first have to build your input matrix $X$ of shape nb_samples
, timesteps
, input_dim
where your batch size divides nb_samples
.
For instance, if nb_samples=1024
and batch_size=64
, it means that your model will receive blocks of 64 samples, compute each output (whatever the number of timesteps is for every sample), average the gradients and propagate it to update the parameters vector.
By default, Keras shuffles (permutes) the samples in $X$ and the dependencies between $X_i$ and $X_{i+1}$ are lost.
With the stateful model, all the states are propagated to the next batch. It means that the state of the sample located at index $i$, $X_i$, will be used in the computation of the sample $X_{i+bs}$ in the next batch, where $bs$ is the batch size (no shuffling).
Keras requires the batch size in
stateful
mode andshuffle=False
.
In [ ]:
'''Example script showing how to use stateful RNNs
to model long sequences efficiently.
'''
from __future__ import print_function
import numpy as np
import matplotlib.pyplot as plt
from keras.models import Sequential
from keras.layers.core import Dense
from keras.layers.recurrent import LSTM, GRU
# since we are using stateful rnn tsteps can be set to 1
tsteps = 1
batch_size = 25
# number of elements ahead that are used to make the prediction
lahead = 1
def gen_cosine_amp(amp=100, period=25, x0=0, xn=50000, step=1, k=0.0001):
"""Generates an absolute cosine time series with the amplitude
exponentially decreasing
Arguments:
amp: amplitude of the cosine function
period: period of the cosine function
x0: initial x of the time series
xn: final x of the time series
step: step of the time series discretization
k: exponential rate
"""
cos = np.zeros(((xn - x0) * step, 1, 1))
for i in range(len(cos)):
idx = x0 + i * step
cos[i, 0, 0] = amp * np.cos(idx / (2 * np.pi * period))
cos[i, 0, 0] = cos[i, 0, 0] * np.exp(-k * idx)
return cos
print('Generating Data')
cos = gen_cosine_amp()
print('Input shape:', cos.shape)
expected_output = np.zeros((len(cos), 1))
for i in range(len(cos) - lahead):
expected_output[i, 0] = np.mean(cos[i + 1:i + lahead + 1])
print('Output shape')
print(expected_output.shape)
print("Sample: ",cos[0], expected_output[0])
plt.subplot(2, 1, 1)
plt.plot(expected_output)
plt.title('Expected')
plt.show()
In [ ]:
epochs = 15
print('Creating Model')
model = Sequential()
model.add(LSTM(50,
batch_input_shape=(batch_size, tsteps, 1),
return_sequences=True,
stateful=True))
model.add(LSTM(50,
batch_input_shape=(batch_size, tsteps, 1),
return_sequences=False,
stateful=True))
model.add(Dense(1))
model.compile(loss='mse', optimizer='rmsprop')
print('Training')
for i in range(epochs):
print('Epoch', i, '/', epochs)
model.fit(cos,
expected_output,
batch_size=batch_size,
verbose=1,
nb_epoch=1,
shuffle=False)
model.reset_states()
print('Predicting')
predicted_output = model.predict(cos, batch_size=batch_size)
print('Ploting Results')
plt.subplot(2, 1, 1)
plt.plot(predicted_output)
plt.title('Predicted')
plt.show()